Credit Card Users Churn Prediction

Objective

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Dictionary:

Libraries

Read and Understand Data:

Observations

Observations

Observations:

Data Preprocessing

Observations

Age

Exploratory Data Analysis

Observations

Observations

Observations

Observation

Observations

Observation

Observations

Profile of customer who attrited most based on there card type

Insights based on EDA

Outlier Detection

Missing Value Detection & Treatment

There are Unknown values for the columns Education_Level,Marital_Status & Income_Category which can be treated as missing values. Replacing Unknown with nan

Missing-Value Treatment

Split the dataset

Encoding categorical variables

Model Building

Model evaluation criterion:

Model can make wrong predictions as:

1.Predicting a customer will churn but he does not - Loss of resources.

2.Predicting a customer will not churn the services but he does - Loss of income

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Model Building Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

Handling Imbalanced dataset

This is an Imbalanced dataset .A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary.

One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

Over Sampling

Since dataset is imbalanced let try oversampling using SMOTE and see if performance can be improved.

The recall on test data is only 0.48 ,and model is overfitting there is lot of discrepancy between test score and train score. let try regularization

What is Regularization ?

Linear regression algorithm works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are large, they can lead to over-fitting on the training dataset, and such a model will not generalize well on the unseen test data.This is where regularization helps. Regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

Main Regularization Techniques

Ridge Regression (L2 Regularization)

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.

Lasso Regression (L1 Regularizaion)

Lasso adds "absolute values of magnitude of coefficient as penalty term to the loss function

Elastic Net Regression

Elastic net regression combines the properties of ridge and lasso regression. It works by penalizing the model using both the 1l2-norm1 and the 1l1-norm1.

Elastic Net Formula: Ridge + Lasso

Regularization on Oversampled dataset

The recall on test data has improved let see if undersampling can improve the recall

Undersampling

Let see try undersampling and see if performance is different.

Logistic Regression on undersampled data

Observation

Model after undersampling is generalized well on training and test set . Our recall after undersampling on test was better than our recall after oversampling on test.Let try regularization and see. Trying to use all the solver and different penality

Model Performance Evaluation and Improvement-Logistic Regression

Logistic Regression with Under sampling is giving a generalized model and best recall with 0.857.

Model building Decision Tree ,Bagging and Boosting

Here I am building different models using KFold and cross_val_score with pipelines and will tune the best model 3 models using GridSearchCV and RandomizedSearchCV

Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.

Hyperparameter Tuning

Comparing all models

Conclusion

Business Recommendations & Insights